Sort by
Advancing face detection efficiency: Utilizing classification networks for lowering false positive incidences

The advancement of convolutional neural networks (CNNs) has markedly progressed in the field of face detection, significantly enhancing accuracy and recall metrics. Precision and recall remain pivotal for evaluating CNN-based detection models; however, there is a prevalent inclination to focus on improving true positive rates at the expense of addressing false positives. A critical issue contributing to this discrepancy is the lack of pseudo-face images within training and evaluation datasets. This deficiency impairs the regression capabilities of detection models, leading to numerous erroneous detections and inadequate localization. To address this gap, we introduce the WIDERFACE dataset, enriched with a considerable number of pseudo-face images created by amalgamating human and animal facial features. This dataset aims to bolster the detection of false positives during training phases. Furthermore, we propose a new face detection architecture that incorporates a classification model into the conventional face detection model to diminish the false positive rate and augment detection precision. Our comparative analysis on the WIDERFACE and other renowned datasets reveals that our architecture secures a lower false positive rate while preserving the true positive rate in comparison to existing top-tier face detection models.

Just Published
Relevant
A comprehensive review of explainable AI for disease diagnosis

Nowadays, artificial intelligence (AI) has been utilized in several domains of the healthcare sector. Despite its effectiveness in healthcare settings, its massive adoption remains limited due to the transparency issue, which is considered a significant obstacle. To achieve the trust of end users, it is necessary to explain the AI models' output. Therefore, explainable AI (XAI) has become apparent as a potential solution by providing transparent explanations of the AI models' output. In this review paper, the primary aim is to review articles that are mainly related to machine learning (ML) or deep learning (DL) based human disease diagnoses, and the model's decision-making process is explained by XAI techniques. To do that, two journal databases (Scopus and the IEEE Xplore Digital Library) were thoroughly searched using a few predetermined relevant keywords. The PRISMA guidelines have been followed to determine the papers for the final analysis, where studies that did not meet the requirements were eliminated. Finally, 90 Q1 journal articles are selected for in-depth analysis, covering several XAI techniques. Then, the summarization of the several findings has been presented, and appropriate responses to the proposed research questions have been outlined. In addition, several challenges related to XAI in the case of human disease diagnosis and future research directions in this sector are presented.

Open Access
Relevant
BT-Net: An end-to-end multi-task architecture for brain tumor classification, segmentation, and localization from MRI images

Brain tumors are severe medical conditions that can prove fatal if not detected and treated early. Radiologists often use MRI and CT scan imaging to diagnose brain tumors early. However, a shortage of skilled radiologists to analyze medical images can be problematic in low-resource healthcare settings. To overcome this issue, deep learning-based automatic analysis of medical images can be an effective tool for assistive diagnosis. Conventional methods generally focus on developing specialized algorithms to address a single aspect, such as segmentation, classification, or localization of brain tumors. In this work, a novel multi-task network was proposed, modified from the conventional VGG16, along with a U-Net variant concatenation, that can simultaneously achieve segmentation, classification, and localization using the same architecture. We trained the classification branch using the Brain Tumor MRI Dataset, and the segmentation branch using a “Brain Tumor Segmentation dataset. The integration of our method’s output can aid in simultaneous classification, segmentation, and localization of four types of brain tumors in MRI scans. The proposed multi-task framework achieved 97% accuracy in classification and a dice similarity score of 0.86 for segmentation. In addition, the method shows higher computational efficiency compared to existing methods. Our method can be a promising tool for assistive diagnosis in low-resource healthcare settings where skilled radiologists are scarce.

Open Access
Relevant
Single-Stage Extensive Semantic Fusion for multi-modal sarcasm detection

With the rise of social media and online interactions, there is a growing need for analytical models capable of understanding the nuanced, multi-modal communication inherent in platforms, especially for detecting sarcasm. Existing research employs multi-stage models along with extensive semantic information extractions and single-modal encoders. These models often struggle with efficient aligning and fusing multi-modal representations. Addressing these shortcomings, we introduce the Single-Stage Extensive Semantic Fusion (SSESF) model, designed to concurrently process multi-modal inputs in a unified framework, which performs encoding and fusing in the same architecture with shared parameters. A projection mechanism is employed to overcome the challenges posed by the diversity of inputs and the integration of a wide range of semantic information. Additionally, we design a multi-objective optimization that enhances the model’s ability to learn latent semantic nuances with supervised contrastive learning. The unified framework emphasizes the interaction and integration of multi-modal data, while multi-objective optimization preserves the complexity of semantic nuances for sarcasm detection. Experimental results on a public multi-modal sarcasm dataset demonstrate the superiority of our model, achieving state-of-the-art performance. The findings highlight the model’s capability to integrate extensive semantic information, demonstrating its effectiveness in the simultaneous interpretation and fusion of multi-modal data for sarcasm detection.

Open Access
Relevant
AFENet: Attention-guided feature enhancement network and a benchmark for low-altitude UAV sewage outfall detection

Inspecting sewage outfall into rivers is significant to the precise management of the ecological environment because they are the last gate for pollutants to enter the river. Unmanned Aerial Vehicles (UAVs) have the characteristics of maneuverability and high-resolution images and have been used as an important means to inspect sewage outfalls. UAVs are widely used in daily sewage outfall inspections, but relying on manual interpretation lacks the corresponding low-altitude sewage outfall images dataset. Meanwhile, because of the sparse spatial distribution of sewage outfalls, problems like less labeled sample data, complex background types, and weak objects are also prominent. In order to promote the inspection of sewage outfalls, this paper proposes a low-attitude sewage outfall object detection dataset, namely UAV-SOD, and an attention-guided feature enhancement network, namely AFENet. The UAV-SOD dataset features high resolution, complex backgrounds, and diverse objects. Some of the outfall objects are limited by multi-scale, single-colored, and weak feature responses, leading to low detection accuracy. To localize these objects effectively, AFENet first uses the global context block (GCB) to jointly explore valuable global and local information, and then the region of interest (RoI) attention module (RAM) is used to explore the relationships between RoI features. Experimental results show that the proposed method improves detection performance on the proposed UAV-SOD dataset than representative state-of-the-art two-stage object detection methods.

Open Access
Relevant
Small group pedestrian crossing behavior prediction using temporal angular 2D skeletal pose

A pedestrian is classified as a Vulnerable Road User (VRU) because they don’t have the protective equipment that would make them fatal if they were involved in an accident. An accident can happen while a pedestrian is on the road, especially when crossing the road. To ensure pedestrian safety, it is necessary to understand and predict pedestrian behaviour when crossing the road. We propose pedestrian intention prediction using a 2D pose estimation approach with temporal angle as a feature. Based on visual observation of the Joint Attention in Autonomous Driving (JAAD) dataset, we found that pedestrians tend to walk together in small groups while waiting to cross, and then this group is disbanded on the opposite side of the road. Thus, we propose to perform prediction with small group of pedestrians, based on pedestrian statistical data, we define a small group of pedestrians as consisting of 4 pedestrians. Another problem raised is 2D pose estimation is processing each pedestrian index individually, which creates ambiguous pedestrian index in consecutive frame. We propose Multi Input Single Output (MISO), which has capabilities to process multiple pedestrians together, and use summation layer at the end of the model to solve the ambiguous pedestrian index problem without performing tracking on each pedestrian. The performance of our proposed model achieves model accuracy of 0.9306 with prediction performance of 0.8317.

Open Access
Relevant
An encrypted traffic identification method based on multi-scale feature fusion

As data privacy issues become more and more sensitive, increasing numbers of websites usually encrypt traffic when transmitting it. This method can largely protect privacy, but it also brings a huge challenge. Aiming at the problem that encrypted traffic classification makes it difficult to obtain a global optimal solution, this paper proposes an encrypted traffic identification model called the ET-BERT and 1D-CNN fusion network (BCFNet), based on multi-scale feature fusion. This method combines feature learning with classification tasks, unified into an end-to-end model. The local features of encrypted traffic extracted based on the improved Inception one-dimensional convolutional neural network structure are fused with the global features extracted by the ET-BERT model. The one-dimensional convolutional neural network is more suitable for the encrypted traffic of a one-dimensional sequence than the commonly used two-dimensional convolutional neural network. The proposed model can learn the nonlinear relationship between the input data and the expected label and obtain the global optimal solution with a greater probability. This paper verifies the ISCX VPN-nonVPN dataset and compares the results of the BCFNet model with the other five baseline models on accuracy, precision, recall, and F1 indicators. The experimental results demonstrate that the BCFNet model has a greater overall effect than the other five models. Its accuracy can reach 98.88%.

Relevant